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, The perennial problem of "how many clusters?" remains an issue 

' of substantial interest in data mining and machine learning commu- 

^ nities, and becomes particularly salient in large data sets such as 

populational genomic data where the number of clusters needs to be 
relatively large and open-ended. This problem gets further compli- 
cated in a co-clustering scenario in which one needs to solve mul- 
tiple clustering problems simultaneously because of the presence of 
common centroids (e.g., ancestors) shared by clusters (e.g., possible 
■ descents from a certain ancestor) from different multiple-cluster sam- 

' pies (e.g., different human subpopulations) . In this paper we present 

a hierarchical nonparametric Bayesian model to address this problem 
in the context of multi-population haplotype inference. 

Uncovering the haplotypes of single nucleotide polymorphisms is 
essential for many biological and medical applications. While it is un- 
common for the genotype data to be pooled from multiple ethnically 
, distinct populations, few existing programs have explicitly leveraged 

' the individual ethnic information for haplotype inference. In this pa- 

per we present a new haplotype inference program, Haploi, which 
makes use of such information and is readily applicable to genotype 
QQ ' sequences with thousands of SNPs from heterogeneous populations, 

, with competent and sometimes superior speed and accuracy com- 

paring to the state-of-the-art programs. Underlying Haploi is a new 
haplotype distribution model based on a nonparametric Bayesian for- 
, malism known as the hierarchical Dirichlet process, which represents 

}_( ' a tractable surrogate to the coalescent process. The proposed model 



(N 
> 
00 



(N 



Received August 2008; revised November 2008. 

^Supported by the National Science Foundation under Grant No. CCF-0523757 and 
by the Pennsylvania Department of Health's Health Research Program under Grant No. 
2001NF-Cancer Heahh Research Grant ME-01-739. 

^Supported by an NSF CAREER Award, under Grant No. DBI-054659 and an Alfred 
P. Sloan Research Fellowship in Computer Science. 

Key words and phrases. Dirichlet process, hierarchical Dirichlet process, haplotype in- 
ference, population genetics, mixture models, coalescence. 



This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Applied Statistics, 
2009, Vol. 3, No. 2, 791-821. This reprint differs from the original in pagination 
and typographic detail. 

i 



2 



K.-A. SOHN AND E. P. XING 



is exchangeable, unbounded, and capable of coupling demographic in- 
formation of different populations. It offers a well-founded statistical 
framework for posterior inference of individual haplotypes, the size 
and configuration of haplotype ancestor pools, and other parameters 
of interest given genotype data. 

1. Introduction. Recent experimental advances have led to an explosion 
of data that document genetic variations within and between populations. 
For example, the International SNP Map Working Group (2001) has re- 
ported the identification and mapping of 1.4 million single nucleotide poly- 
morphisms (SNPs) from the genomes of four different human populations. 
These data pose challenging inference problems whose solutions could shed 
light on the evolutionary history of human population and the genetic basis 
of disease propensities [Chakravarti (2001), Clark (2003)]. 

SNPs represent the largest class of individual differences in DNA. A SNP 
refers to the existence of two possible nucleotide bases from {^4, C, G, T} at a 
chromosomal locus in a population; each variant, denoted as 1 or 0, is called 
an allele. A haplotype refers to the joint allelic identities of a contiguous list 
of polymorphic loci within a study region on a given chromosome. Diploid 
organisms such as human beings have two haplotypes in each individual, 
one maternal copy and one paternal copy. When the parental chromosomes 
come in pairs, two haplotypes go together and make up a genotype which 
consists of the list of allele-pairs at every locus. More precisely, a genotype 
is resulted from a pair of haplotypes by omitting the information regarding 
the specific association of each allele with one of the two chromosomes — its 
phase, at every locus. The problem of haplotype inference, which is the fo- 
cus of this paper, concerns determining which phase reconstruction among 
many alternatives is more plausible. Common biological methods for assay- 
ing genotypes typically do not provide phase information for individuals 
with heterozygous genotypes at multiple loci. Although phase can be ob- 
tained at a considerably higher cost via molecular haplotyping [Patil et al. 
(2001)], or sometimes from analysis of trios [Hodge, Boehnke and Spence 
(1999)], it is desirable to develop automatic and robust in silico methods for 
reconstructing haplotypes from the inexpensive genotype data. 

Key to the inference of individual haplotypes based on a given genotype 
sample is the formulation and tractability of the marginal distribution of 
the haplotypes of the study population. Consider the set of haplotypes, de- 
noted as H = {hi,h2, ■ ■ ■ , h2n}, of a random sample of 2n chromosomes of n 
individuals. Under common genetic arguments, the ancestral relationships 
among the sample back to its most recent common ancestor (MRCA) can 
be described by a genealogical tree known as the coalescent; computing the 
P{H) involves a marginalization over all possible coalescent trees compatible 
with the sample, which is widely known to be intractable. Li and Stephens 
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(2003) suggested to approximate P{H) by a Product of Approximate Con- 
ditionals (PAC). The PAC model tries to incorporate a desirable evolution 
assumption known as the parental- dependent- mutation (PDM) by model- 
ing each hi as the progeny of a randomly-chosen existing haplotype, and 
it forms the basis of the PHASE program, which has set the state-of-the- 
art benchmark in haplotype inference. However, the PAC model implicitly 
assumes existence of an ordering in the haplotype sample, therefore, the 
resulting likelihood is not exchangeable as one would expect for the true 
P[H). Moreover, since PAC involves no explicit ancestral genealogy over 
existing haplotypes, certain latent demographic information such as found- 
ing haplotypes and their mutation rates are not directly captured in the 
model. 

The finite mixture models represent another class of haplotype models 
that rely very little on demographic and genetic assumptions of the sam- 
ple [Excoffier and Slatkin (1995), Niu et al. (2002), Kimmel and Shamir 

(2004) , Zhang, Niu and Liu (2006)]. Under such a model, haplotypes are 
treated as latent variables associated with specific frequencies, and the hap- 
lotype inference problem can be viewed as a missing value inference and 
parameter estimation problem, for which numerous statistical inference ap- 
proaches have been developed, such as the maximum likelihood approaches 
via the EM algorithm [Excoffier and Slatkin (1995), Hawley and Kidd (1995), 
Long, Williams and Urbanek (1995), Fallin and Schork (2000)], and a num- 
ber of parametric Bayesian inference methods based on Markov chain Monte 
Carlo (MCMC) samphng [Niu et al. (2002), Zhang, Niu and Liu (2006)]. 
However, this class of methods has rather severe computational require- 
ments in that a probability distribution must be maintained on a (large) set 
of possible haplotypes. Indeed, the size of the haplotype pool, K, which re- 
flects the diversity of the genome, is unknown for any given population data 
and needs to be inferred. There is a plethora of combinatorial algorithms 
based on various hypotheses, such as the "parsimony" principles that offer 
control over the complexity of the inference problem [see Gusfield (2004) 
for an excellent survey]. On the other hand, most current methods based 
on statistical inference employ computationally expensive techniques such 
as cross validation or model-selection to address the issue of ancestral-space 
uncertainty [Scheet and Stephens (2006)]. 

Indeed, the uncertainty regarding the size of the haplotype pool is an 
instance of the perennial problem of "how many clusters?" in the clustering 
literature. The problem is particularly salient in large data sets where the 
number of clusters needs to be relatively large and open-ended — exactly 
the scenario in population genomic analysis. In Xing, Sharan and Jordan 
(2004) we have proposed a nonparametric Bayesian model, specifically the 
Dirichlet process (DP) [Blackwell and MacQueen (1973), Ferguson (1973)], 
which provides a prior and posterior distribution for mixture models with 
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unbounded numbers of mixture components. Recently, substantial efforts 
have also been made to speed up haplotype inference on large scale data. 
Notable programs include Beagle [Browning and Browning (2007)], which 
uses a localized haplotype model based on variable-length Markov chains, 
and MACH [Li and Abecasis (2006)], a fast version of PHASE. 

These progresses notwithstanding, it is noteworthy that most progresses 
being made so far on approximating the P{H) and K, and on dealing with 
long SNP sequences, do not explicitly leverage the potentially useful side in- 
formation such as the genetic origins of individuals in a population sample. In 
particular, statistical models developed so far are inadequate for addressing 
the multi-population haplotype sharing problems concerned in this paper. 
Consider, for example, a genetic demography study, in which one seeks to 
uncover ethnic- or geographic-specific genetic patterns based on a sparse cen- 
sus of multiple populations. In particular, suppose that we are given a sam- 
ple that can be divided into a set of subpopulations, for example, African, 
Asian and European. When conducting haplotype inference on such data, 
we may not only want to discover the sets of haplotypes within each sub- 
population, but also which haplotypes are shared between subpopulations, 
and what their frequencies are. Empirical and theoretical evidence suggests 
that an early split of an ancestral population following a populational bot- 
tleneck (e.g., due to sudden migration or environmental changes) can lead 
to subpopulation-specific genetic diversity, which causes ancient haplotypes 
(that have higher variability) to be shared among different subpopulations, 
and unique modern haplotypes (that are more strictly conserved) to be in- 
stantiated and inherited in different subpopulations [Pritchard (2001)]. This 
structure is analogous to a co-clustering scenario in which different groups 
comprising multiple clusters may share clusters with common centroids (e.g., 
different news topics may share some common key words). The implication 
of this phenomenon on haplotype reconstruction has not been thoroughly 
investigated. 

A naive solution to the aforementioned problem would be to infer haplo- 
types separately in the subpopulations. This is clearly suboptimal, however, 
because it may unnecessarily fragment the data, and may lead to unrobust 
estimation of demographic parameters. In particular, for rare haplotypes 
that are present in a small number of individuals (e.g., one or two) in each 
population but overall still have many bearers across all populations, the 
estimation of their founders (i.e., the centroid) should take into account of 
these bearers in all populations jointly, rather than being based on each pop- 
ulation separately. Essentially, what we want is a model to solving multiple 
clustering problems simultaneously. In this paper we describe a new hap- 
lotype model based on a hierarchical Dirichlet process (HDP) [Teh et al. 
(2006), Xing et al. (2006)], which directly address this issue. An HDP over 
a measurable space {^,B) specifies a set of coupled random distributions 
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{Qi, Q2, • • • , Qj} on <I> = £: X ^, where £ = [0, 1] and A = {0, 1}^ denote the 
space of the mutation rates and joint allele configurations, respectively, of 
the ancestral haplotypes of T SNP loci. Each Qj is a population-specific 
Dirichlet process (DP) [Blackwell and MacQueen (1973), Ferguson (1973)] 
which defines a nonparametric prior over the ancestral haplotypes and their 
frequencies of being inherited within the population, and thereby induces 
a Dirichlet process mixture (DPM) model [Antoniak (1974)] for all the 
individual haplotypes in that population. As detailed in Section 2, to allow 
every ancestral haplotype in a particular population to also have nonzero 
probability of being inherited in a different population (albeit with different 
frequencies), a hyper-prior Qq, which is also a Dirichlet process and there- 
fore discrete on is used to define the base measures of each population- 
specific Qj, ensuring that they are all realized on a common set of sup- 
ports (i.e., ancestors) in <I>. Our model differs from other methods reviewed 
earlier in the following ways: (1) Instead of resorting to empirical assump- 
tions or model selection over the number of population haplotypes, we in- 
troduce a nonparametric prior over haplotype ancestors, which facilitates 
posterior inference of the haplotypes in an "open" state space accommodat- 
ing arbitrary sample size. (2) Our model explicitly exploits the subpopula- 
tion labels and potentially latent genetic demographic structures to improve 
haplotyping accuracy. (3) Our model captures similar genetic properties as 
those emphasized in Stephens, Smith and Donnelly (2001), including the 
parent-dependent-mutations, but with an exchangeable likelihood function. 

We have developed an efficient MCMC-based software program Haploi, 
based on our proposed model, and using a variant of the Partition-Ligation 
scheme by Niu et al. (2002) to handle complexity explosion due to long 
input sequence. It can be readily applicable to multi-population genotype 
sequences, at a time-cost often at least two-orders of magnitude less than 
that of the state-of-the-art PHASE program, with competitive performance. 
We also show that Haploi can significantly outperform other popular hap- 
lotype inference algorithms on both simulated and real short SNPs data. 

2. The statistical model. Our narration below starts with a basic Dirich- 
let process mixture model for a simple demographic scenario, where we ig- 
nore individual subpopulation labels and assume absence of recombination 
in the sample. Then we describe the hierarchical Dirichlet process mixture 
for haplotypes from multiple populations in detail. There is an interesting 
connection of the DPM-based methods to the Wright-Fisher model and 
Kingman's coalescent with an infinitely-many-alleles (IMA) mutation pro- 
cess for allele evolution, which we will briefly discuss. 

2.1. Dirichlet process mixture for haplotypes. A random probability mea- 
sure Q on a measurable space {^,B) is generated by a Dirichlet process 
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DP(t, Qo) if for every measurable partition Bi, . . . , oi the sample space 
the vector of random probabilities Q{Bi) follows a finite dimensional Dirich- 
let distribution: (Q(-Bi), . . . , Q(Sfc)) ~ T>\y(tQq{Bi), . . . ,rQo(5fc)), where r > 
denotes a scaling parameter and Qo denotes a base measure defined on 
{^,B) [Ferguson (1973)]. 

Samples from a DP tend to cluster around the distinct-valued atoms, 
which offers a salient way for us to group haplotypes around common ances- 
tors. This property is best reflected in the constructive definition of the DP 
based on the following Polya urn scheme [Blackwell and MacQueen (1973)]. 
Having observed n samples with values ((/>!, . . . , (f)n) from DP(r, Qo)) the con- 
ditional distribution of the value of the (n + l)th sample is given by 

K 

(1) (/)„,+i I (/)!,. ..,(/)„,, r,Qo ~ Y\ — \ — ^4>i{') \ — '3o(-)) 

n + T n + T 

fc=i 

where denotes the number of samples with value 0^, and K denotes the 
number of unique values in the n samples drawn so far. This expression 
means that each new sample has positive probability of being equal to an 
existing unique value in the drawn samples, and, moreover, the probability 
is proportional to the occupancy number of the unique values, creating a 
clustering effect. The cluster cardinality is a random integer that is only 
bounded by the sample size, rather than being pre-specified. 

To model a haplotype population H that is genetically homogeneous, 
one can assume that H is originated from a size-unknown group of dis- 
tinct ancestral haplotypes (i.e., founders), and associate each unique value 
(j)*^ from a DP with a possible founder and its mutation probability, that 
is, {afc,0fc}. Relating every drawn sample (f)i to a modern individual hap- 
lotype via a conditional likelihood function, we arrive at a DP mixture 
model [Antoniak (1974), Escobar and West (1995)] for P{H) as described 
in Xing, Sharan and Jordan (2004), which is briefly recapitulated in the 
sequel for self-containedness and to introduce necessary notation for subse- 
quent exposition of the hierarchical DP mixture for multi-population data. 
Specifically, write Hi^ = [Hi^^i, . . . ^Hi^^t], where the sub-subscript e G {0, 1} 
denotes the two possible parental origins (i.e., paternal and maternal), for a 
haplotype over T contiguous SNPs from individual i, and let Gi = [Gj^i, . . . , 
Gi^x] denote the genotype of these SNPs of individual i. Let = [^/c,i, . . . , 
A^^t] denote an ancestor haplotype and 9^ denote the mutation rate of an- 
cestor k, and let Cj denote an inheritance variable that specifies the ancestor 
of haplotype Hi. Ph{H\A,6) represents the inheritance model according to 
which individual haplotypes are derived from a founder, and Pg{G\HQ, Hi) 
indicates the genotyping model via which noisy observations of the geno- 
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types are related to the haplotypes. A DPM defines the following generative 
scheme: 



• Draw first haplotype: 



sample the 1st haplotype from an inheritance model defined on the 1st 
founder; 
• for subsequent haplotypes: 

- sample the founder indicator for the ith haplotype: 



- sample the haplotype according to its founder: 

hi\ci ~ Ph{-\aci,Oc^). 

• sample all genotypes (according to a mapping between haplotype index i 
and allele index ig): 



Here, {a^jOk} corresponds to the set of mixture components, and the DP 
is used as the prior over the components in an unbounded ancestral space. 
This prior requires no specification of the size of the ancestor pool. 

2.2. Hierarchical DP mixture for multi-population haplotypes. Now con- 
sider the case where there exist multiple ethnic or geographic populations. 
Instead of modeling these subpopulations independently by unrelated DPMs, 
we place all the population-specific DPMs under a common prior, such that 
the ancestors in any of the population-specific mixtures can be shared across 



ai,^i|DP(r, Qo)~Qo(-) 
sample the 1st founder (and its mutation rate); 

hi ~ P/,(-|ai,6'i), 




where is the occupancy number of founder ■ 
sample the founder of haplotype i: 




gi\hio,hi-^ ~ Pg(-|/ijo, /ijj. 
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all the mixtures, but the weight of an ancestral haplotype in each mixture 
is unique. 

To tie population-specific DP mixtures together in this way, we employ 
a hierarchical DP (HDP) mixture model [Teh et al. (2006)], in which the 
base measures of the all population-specific DPMs admit a common dis- 
crete prior defined by another Dirichlet process DP(7,i<'). An HDP de- 
fines a distribution over a set of dependent random probability measures, 
{Qj, J = 1, . . . , J}, and another master random probability measure Qo that 
controls all the Qj's. Each Qj is a population specific DP with common (or 
population-specific) scaling parameter r, and a shared base measure defined 
by Qo- Moreover, Qo itself follows a Dirichlet process DP(7,i<'). Following 
a hierarchical Polya urn scheme, for rrij random draws = 4>j^i, ■ ■ ■ ,4'j,mj 
from Qj, we can derive the following conditional probability for {(j)mj \ <P-m ) 
[Xing et al. (2006)], where the subscript —rrij denotes the index set of all 
but the nijth sample: 



1^ T"^ ^<i>t y^^3 > 

f-^, rrii — l + T * ^ 

k=l ■> 



(2) + T— ^('^™.) 

raj — l + rn — 1 + 7 ^ 

K 

= '^'j.khl ^^rri, ) + ■^'j,K+lF{(Pni,), 
k=l 

where Uk denotes the number of samples under Qo drawn from the global 
measure F and equal to 0^, rrij^k denotes the number of samples in the 
jth group which are equal to 4>l, and tTj ,^ := "^J.fc+J^"*/(^^ ^ 7rj j^_^_i = 
^ '^1, „ Tj_^ • The vector vf' = (vr' i , vr' n, . . .) gives the a priori conditional 
probability of a new sample in group j. As shown later, this formula will be 
useful for implementing a Gibbs sampler for posterior inference under HDP 
mixtures. 

Based on the HDP described above, we now define an HDP mixture 
(HDPM) model for the genotypes in J populations. Elaborating on the 
notational scheme used earlier, let = [Gpj^ , . . . , G^y] denote the geno- 
type of T contiguous SNPs of individual i from subpopulation group j, and 
let H^^J = [H^^-^, . . . , H^'^Jrp] denote a haplotype of individual i from eth- 
nic group j. The basic generative structure of multi-population genotypes 
under an HDPM is as follows, which is also illustrated graphically in Fig- 
ure 1: 
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sample haplotype ie in population j; 




sample genotype i in population j, 



where the first three steps of sampling founder haplotypes follow the HDP 
scheme, the fourth step describes the mixture formulism, and the last step 
corresponds to the noisy genotyping model. Recall that in an HDP the base 
measure Qq is a random distribution of the pool of haplotype founders and 
their associated mutation rates. It ensures that all the population-specific 
child DPs can be defined on a common unbounded pool of candidate founder 
patterns. The child DPs place different mass distributions, that is, a priori 
frequencies of haplotype founders, on this common support, in a population- 
specific fashion. 

The base measure F in the above generative process is defined as a distri- 
bution from which haplotype founders (pk = {Ak,Ok} are drawn. Thus, it is 
a joint measure on both A and 9. We let F{A,9) = p{A)p{6) , where p{A) is 
uniform over all possible haplotypes and p{6) is a beta distribution introduc- 
ing a prior belief of low mutation rate. For generality, we assume A^^t and 
Hi^t of every single locus take values from an allele set A. For other building 
blocks of the HDPM model, we propose the following specifications. 

2.2.1. Haplotype inheritance model. Omitting all but the locus index t, 
we can define our inheritance model to be a single-locus mutation model as 
follows [Xing, Sharan and Jordan (2004)]: 



( 



6 



) 



(3) 



1^1 -1 
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Fig. 1. The haplotype-genotype generative process under HDPM, illustrated by an ex- 
ample concerning three populations. At the first level, all haplotype founders from different 
populations are drawn from a common pool via a Polya urn scheme, which leads to the 
following effects: 1. the same founder can be drawn by either multiple populations (e.g., 
the red founder in population 1 and 2, and the blue one in population 1 and 3), or only 
a single population (e.g., the grey founder in population 1); 2. shared founders can have 
different frequencies of being inherited. Then at the second level, individual haplotypes were 
drawn from a population- specific founder pool also via a Polya urn scheme, but this time 
through an inheritance model Ph(-\ak) that allows mutations with respect to the founders, 
as indicated by the underscores at the mutated loci in the individual haplotypes. Finally, 
genotypes are related to the haplotype pairs of every individual via a noisy channel Pg{-\-). 



where I(-) is the indicator function. This model corresponds to a star ge- 
nealogy resulting from infrequent mutations over a shared ancestor, and is 
widely used as an approximation to a full coalescent genealogy starting from 
the shared ancestor [e.g., Liu et al. (2001)]. 

Given this inheritance model, and under a beta prior Beta{ah, f3h) for the 
mutation rate 9, it can be shown that the marginal conditional distribution of 
a haplotype sample h = {hi^ : e G {0, 1}, i € {1,2, ... , I}} takes the following 
form resulted from an integration of 9 in the joint conditional: 

(4) p(h|a. c) = n M ^ ft. + ,^ ^ ) 1 , 

where R{ah,Ph) = r{ah)T{M ' = J2i,e,tKf^ie,t = Ofc,t)I(ci^ = A;) is the num- 
ber of alleles which are identical to the ancestral alleles, and = J2i e t ^i^ie 
afe^t)I(c«e = is the total number of mutated alleles. 

2.2.2. Genotype observation model. We assume that the genotype at a 
locus is determined by the paternal and maternal alleles of this site via the 
following genotyping model [Xing, Sharan and Jordan (2004)]: 

(5) p,(9i/i.o,i,/in,^e) =e'("=^^[w(i -e)]'^"^'^^[;^2(i 
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where h = hig^t © hj^^,t denotes the unordered pair of two actual SNP aUele 
instances at locus t; denotes set difference by exactly one element; "^2" 
denotes set difference of both elements, and fii and ^2 are appropriately 
defined normalizing constants. Again we place a beta prior Iieta{ag, (3g) 
on for smoothing. Under the above model specifications, it is standard 
to derive the posterior distribution of each haplotype Hi^ given all other 
haplotypes and all genotypes, and the posterior of any missing genotypes, 
by integrating out parameters or ^ and resorting to the Bayes theorem, 
which enables a collapsed Gibbs sampling step where necessary. 

2.2.3. Hyperprior for scaling parameters. To capture uncertainty over 
the scaling parameters, for example, 7, we use a vague inverse Gamma prior: 

(6) p(7^^) p(7) cx7"^exp(-l/7). 

In general, the probability density function of an inverse Gamma distribution 
with shape parameter l and scale parameter k is given as follows: 

k'' 

Under this prior, the posterior distribution of 7 depends only on the number 
of instances n, and the number of components but not on how the samples 
are distributed among the components: 

r7^ / II. N 7'-^exp(l/7)r(7) 

(7) p[-i\k,n)<:x — — ■ — ^ . 

r(n + 7) 

The distribution p(log(7)|/i;,n) is log-concave, so we may efficiently gen- 
erate independent samples from this distribution using adaptive rejection 
sampling [Rasmussen (2000)]. It is noteworthy that in an HDPM we need 
to define vague inverse Gamma priors also for the scaling parameters r of 
population-specific DPs at the bottom level. We use a single concentration 
parameter r for these DPs; it is also possible to allow separate concentration 
parameters for each of the lower-level DPs, possibly tied distributionally via 
a common hyperparameter. 

2.3. Posterior inference via Gibbs sampling. Based on the two-level Polya 
urn implementation of HDPM, an efficient MCMC algorithm, which is sim- 
ilar to the MCMC algorithms developed for DPM, can be derived to sample 
from the posterior associated with HDPM. Specifically, under a collapsed 
Gibbs sampling scheme where 6 and ^ are integrated out, the variables of 

interest are C^^J^, ^4^4, H^^J^, 7, and r, \/i,j,k,t,e. The sampler alternates 
between three coupled stages. First, it samples the scaling parameters 7 and 
r of the DPs, following the predictive distribution given by equation (7). 
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Then, it samples the cfj^s and afc,t's given the current values of the hidden 
haplotypes and the scaling parameters according to equations (8) and (9) 
(Appendix), respectively. Finally, given the current state of the ancestral 
pool, the ancestor assignment for each individual and the observed geno- 
types, it samples the h[-'\ variables according to equation (10). Details of 
the forms and derivations of the predictive distributions used for steps 2 and 
3 are given in the Appendix. 

2.4. Population genetic implication of DP-based haplotype models. There 
is an interesting connection between the Dirichlet process models and the 
well-known coalescent process theory underlying population genetic evo- 
lution [Kingman (1982)]. It can be shown that an infinitely-many- alleles 
(IMA) model with rate r/2 on an n-coalescent extends haplotype lineages on 
the coalescent tree according to the following law: with probability r / (n — 1 + 
r) , it instantiates a new haplotype, and with probability (n — l)/(n — 1 + r) , 
it replicates an existing haplotype lineage [Hoppe (1984)]. This is identical 
to the Polya urn scheme described in equation (1) with scaling parameters 
r and uniform base distribution over A, a Dirichlet process DP (r, Uniform). 

There is a mapping between the distinct founders (p^ = {ak,Ok},yk arising 
from a DP, to the novel haplotypes generated according to IMA on a coa- 
lescent tree at the birth of every new lineage. However, since these founders 
are independently draw, from the base measure, a basic DP cannot capture 
relationships between different founding haplotypes in a population. 

The parental-dependent-mutation model posits that, in a sequential gen- 
eration process of haplotypes, if the next haplotype does not match exactly 
with an existing haplotype, it will tend to differ by a small number of mu- 
tations from an existing one, rather than be completely different. Under a 
DP mixture, modern individual haplotypes hi are marginally dependent, 
because similar but nonidentical haplotypes can be grouped around possible 
founders according to an inheritance model Ph{H\A,6) that permits further 
changes on top on founders. As discussed later, this leads to an exchangeable 
P{H) that captures the effect of parent-dependent mutations. 

3. Partition-ligation and the Haploi program. As for most haplotype 
inference models proposed in the literature, the state space of the pro- 
posed HDPM model scales exponentially with the length of the genotype 
sequence and, therefore, it cannot be directly applied to genotype data con- 
taining hundreds or thousands of SNPs. To deal with haplotypes with a large 
number of linked SNPs, Niu et al. (2002) proposed a divide-and-conquer 
heuristic known as Partition-Ligation (PL), which was adopted by a num- 
ber of haplotype inference algorithms including PL-EM [Qin, Niu, and Liu 
(2002)], PHASE [Stephens, Smith and Donnelly (2001), Li and Stephens 
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(2003)] and CHB [Zhang, Niu and Liu (2006)]. We equipped the HDPM 
model with a variant of the PL heuristic, and present a new tool, Haploi for 
haplotype inference of multiple population genotype data over long SNPs 
sequences. 

The original PL-scheme in Niu et al. (2002) first divides the entire se- 
quence into disjoint short blocks and reconstructs haplotypes within each 
block. Then pairs of blocks are recursively ligated into larger (nonoverlap- 
ping) haplotypes via Gibbs sampling under a fixed-dimensional Dirichlet 
prior over the frequencies of the ligated haplotype in the product space (or 
a subset) of all the "atomistic haplotypes" of every pair of blocks. This 
bottom-up approach can recover haplotypes of every individual either hi- 
erarchically or progressively. However, this PL scheme does not scale well 
to long sequences because the number of possible haplotypes in the product 
space can quickly become intractable as the size of the nonoverlapping blocks 
to be ligated grows multiplicatively during the iteration. Unlike their ap- 
proach, our PL-scheme generates partially overlapping intermediate blocks 
from smaller blocks phased at the lower level. The pairs of overlapping blocks 
are recursively merged into larger ones by leveraging the redundancy of in- 
formation from overlapping regions, as well as overall parsimonious criteria. 
Empirically we found that this strategy can lead to a significant reduction 
of the size of the haplotype search space for long genotypes, and therefore 
facilitates a more efficient inference algorithm. 

Figure 2 outlines the PL-procedure adopted by Haploi, which can be 
divided into three steps. In step 1 we begin by partitioning given geno- 
type sequences into L short blocks of length T [e.g., T < 10 as suggested in 
Niu et al. (2002)]. Then we phase each atomistic block using the proposed 
HDPM (Figure 2, step 1). By doing this, we obtain all the individual haplo- 
types and also the population haplotype pool (i.e., founders) for each block. 
In the next step we ligate every pair of neighboring blocks. Naively the can- 
didate population haplotype pool for the ligated segment can be a Cartesian 
product of the haplotype pools in neighboring blocks. But such an uncon- 
strained product is in fact unnecessary. Since each individual harbors only 
two possible haplotypes within each block, for each pair of adjacent blocks, 
we can impute at most four new stitched haplotypes from an individual, but, 
in practice, we get much fewer because an individual can be homozygous on 
one or both blocks and the stitched haplotypes may have been imputed al- 
ready from earlier individuals; also, not all combinations of haplotypes in 
the two pools are necessary because some combinations may never exist in 
any individual. We pool such stitched haplotypes imputed from all individ- 
uals, which usually leads to only a small subset of the Cartesian product of 
the two haplotype pools. Then based on a finite dimensional Dirichlet prior 
over the candidate pool, we do Gibbs sampling as in Niu et al.'s PL scheme 
to obtain individual haplotypes for each overlapping 2T region. Essentially, 
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Fig. 2. The partition-ligation scheme used in Haploi. 



our procedure produces a more parsimonious set of population haplotypes 
by using an individual-based population haplotype imputation scheme. In 
addition, comparing to the ligation in Niu et al.'s scheme, we stitch every 
neighboring pair of blocks [ith with (i + l)th], whereas they ligate every odd 
numbered block with the next even numbered block [i.e., (2z — l)th with 
(2i)th]. In step 3 we hierarchically ligate overlapping adjacent blocks from 
the previous iteration, until the full sequence is covered (Figure 2, step 3). 
The ligation strategy is again different from that of Niu et al.'s due to the 
haplotype consistency constraints imposed by overlapping SNPs, which helps 
to reduce the candidate haplotype space of the merged blocks. More details 
about the entire partition-ligation process can be found in the Appendix. 

As we reduce the search space based on feasible individual haplotype 
pairs, there may be possibility of missing some haplotypes in the haplotype 
space construction if the ligation is only based on disjoint blocks. However, 
our ligation process considers two blocks with an overlapping region and 
takes into account all the possible inconsistencies for every heterozygous 
locus. Therefore, the actual number of haplotypes added to the space can 
be greater than four in general, except for the first pairwise ligation stage in 
step 2 (see the Appendix for a detailed example of this). Moreover, even in 
the pairwise ligation from the nonoverlapping atomic blocks, this risk can 
be reduced by considering every neighboring pair, not every odd-numbered 
and even-numbered pair as noted above, as the information in one block 
can be propagated into both side of neighbors and can be preserved better. 
Empirically, this new scheme led to a more accurate result than the original 
PL scheme with greatly improved computational cost, as the original PL 
scheme cannot be applied to more than a few hundred of the SNPs. 

The underlying intuition of our ligation procedure is to allow recombination- 
like transition on the overlapping regions for including not only all the nec- 
essary new haplotype configurations, but also to maximally preserve the 
haplotypes obtained at previous steps. This heuristic typically results in a 
population haplotype space of the merged block that is much smaller than 
the naive product-space of nonoverlapping lower-level blocks. Moreover, in- 
dividuals whose atomistic haplotypes of the pre-merged blocks have no dis- 
crepancy in the overlapping region would not only contribute very few but 
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high-confidence population haplotypes to the pool, but also they need not 
to be phased again in that ligation step. This constitutes the main source 
of efficiency and effectiveness of our algorithm. 

In summary, comparing to Niu et al.'s PL scheme, our method attempts to 
build a more parsimonious set of population haplotypes at each ligation iter- 
ation by using an individual-based population-haplotype imputation scheme 
that leverages haplotypic diversity constraints imposed by individual geno- 
types and overlapping blocks. However, these modifications only help to 
better trim the population haplotype space; statistically, it results in a near 
irreducible (due to restriction on the search space) but faster mixing Markov 
chain during haplotype sampling. 

4. Results. We evaluated the proposed HDPM model on both simulated 
genotype data and real genotype sequences from the International HapMap 
database. The haplotype inference accuracy under HDPM (via the Haploi 
program) is compared to that of the the baseline DP mixture model, and to 
PHASE 2.1.1 [Stephens, Smith and Donnelly (2001), Stephens and Scheet 

(2005) ], fastPHASE [Scheet and Stephens (2006)], MACHl.O [Li and Abecasis 

(2006) ] and Beagle 2.1.3 [Browning and Browning (2007)], in their default 
parameter settings unless otherwise specified. Two different error measures 
are used: err^, the ratio of incorrectly phased SNP sites over all nontriv- 
ial heterozygous SNPs, and dw, the switch distance, which is the number 
of phase flips required to correct the predicted haplotypes over all nontriv- 
ial cases. For short SNP sequences we primarily use err^, whereas for long 
sequences we compare according to common practice. In addition to hap- 
lotype inference, on the simulated data we also estimated other metrics of 
interest, such as the haplotype frequencies, the mutation rates 6 of each 
founding haplotypes and the number of reconstructed haplotype founders 
K, to assess the consistency of our model. 

4.1. Simulated multi-population SNP data. To simulate multi-population 
genotypes, we used a pool of haplotypes taken from the coalescent-based syn- 
thetic dataset in Stephens, Smith and Donnelly (2001), each containing 10 
SNPs, as the hypothetical founders; and we drew each individual's haplo- 
types and genotype by randomly choosing two ancestors from these founders 
and applying the mutation and noisy genotyping models described in the 
methodology section. For each of our synthetic multi-population data sets, 
we simulated five populations, each with 20 individuals. Each population is 
derived from 5 founders, where two of them are shared across all the popu- 
lations, and the other three are population-specific. Thus, the total number 
of founders across the five populations is 17. We test our algorithm on two 
data sets with different degrees of sequence diversity. In the conserved data 
set we set the mutation rate 6 to be 0.01 for all populations and all loci in 
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the simulation; in the diverse data set, 6 is set to be 0.05. Ah populations 
and loci are assumed to have the same genotyping error rate. Fifty random 
samples were drawn from both the conserved and the diverse data sets. 

4.1.1. Haplotype accuracy. We compare Haploi (i.e., HDP) and other 
methods applied in two modes on synthetic data. Given multi-population 
genotype data, to use DP or other extant methods, one can either adopt 
mode-I, pool all populations together and jointly solve a single haplotype 
inference problem that ignores the population label of each individual, or 
follow mode-II, apply the algorithm to each population and solve multiple 
haplotype inference problems separately. Haploi takes a different approach, 
by making explicit use of the population labels and jointly solving mul- 
tiple coupled haplotype inference problems. Note that when only a single 
population is concerned, or no population label is available, Haploi is still 
applicable and is equivalent to a baseline DP with one more layer of DP 
hyper-prior over the base measure. We compare the overall performance of 
Haploi on the whole data with other algorithms run in mode-I, and also the 
accuracy of Haploi within each population with those of other methods run 
in mode-II. Since fastPHASE can also take account of population labels, we 
supplied the labels to fastPHASE in mode-I experiments. 

We first test how much HDP can gain by the hierarchical structure on 
multiple populations compared to the baseline DP. Figure 3 compares the 
result of HDP with the baseline-DP in mode-I (denoted by DP-I) and that 
in mode-II (denoted by DP-II) on synthetic multiple populations. On both 
the conserved samples, which are presumably easier to phase, and the di- 
verse samples, which are more challenging, HDP significantly outperformed 
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Fig. 3. A comparison of HDP with the baseline DP on the synthetic multi-population 
data. DP-II; DP run on each separate population (mode-11). DP-I: DP run on a merged 
population (mode-1). The errors measured by site-discrepancies over 50 random samples 
are presented for (a) conserved datasets (9 = Q.Q1) and (b) diverse datasets (9 = Q.G^). 
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DP in both modes (with p = 0.0336 against DP-II on the conserved samples, 
and p < 1.83 x 10~^ in ah other comparisons, according to a paired t-test). 
In addition, as a basehne case, we apphed HDP to each single-population 
separately as DP in mode-II, assuming the scenario of a single population 
or individuals without population labels. Again, HDP applied to all popu- 
lations jointly outperformed this baseline HDP significantly, as the latter is 
deprived of the gain by information sharing. Moreover, this baseline HDP 
also dominates DP in mode-H significantly, especially on diverse datasets 
{p < 0.0017). It appears that the hierarchical structure of HDP which intro- 
duces a nonparametric hyper-prior over the base measure of a DPM allows 
more flexibility in the model and gives better performance than a plain DPM 
with fixed base measure. 

Figure 4 shows boxplots for the differences between the error rate of each 
benchmark algorithm and that of HDP (i.e., errjaig} — errnop)- Note that 
the regions above the horizontal line y = correspond to the cases where 
HDP outperforms others. When other algorithms are run in mode-I [Figure 
4(a)], Haploi outperforms all of them significantly on both the conserved 
and diverse samples (p < 8.9 x 10~^). Haploi remains competitive in com- 
parison with other methods when the latter are run in mode-II, that is, on 
each population separately [Figure 4(b)]. On the conserved data, PHASE 
shows the best result, but the differences between algorithms are not sig- 
nificant (p < 0.11). Whereas on the diverse data, Haploi outperforms other 
algorithms significantly {p < 0.0043). Again, all significant scores were com- 
puted according to a paired t-test. 
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Fig. 4. A comparison of HDP with other methods (fPh: fastPHASE, Ph: Phase, Ma: 
Mach, Be: Beagle) running in (a) mode-l, and (b) mode-II, on synthetic multi-population 
data. Boxplots for the differences between the error rate of each algorithm and that of HDP 
(i.e., errjaig} — erruop) are presented. 
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4.1.2. Parameter estimation and sensitivity analysis. Typically, with ran- 
dom initialization, the Gibbs sampler for Haploi converges within 1000 it- 
erations on the synthetic data. This contrasts sampling algorithms used in 
some of the other haplotype models, which typically need tens of thousands 
of iterations to reach convergence. The fast convergence is possibly due to 
Haplofs ability to quickly infer the correct number of founding haplotypes 
underlying the genotypes samples, which leads to a model significantly more 
compact (i.e., parsimonious) than that derived from other methods. 

Estimating K and 9. We compared the estimated K — the number of re- 
covered ancestors via both HDP and DP. Recall that we expect K to be 17. 
Overall, the estimated K under both the DP and HDP models turns out 
to be very close to this number on the conserved datasets; from the diverse 
data sets, HDP can still offer a good estimate of the number of ancestors, 
whereas DP recovered more ancestors (around 25 on average) than the true 
number. This is not surprising since a haplotype which appears in more than 
one population can have different frequencies in different populations, the 
baseline DP cannot capture such sub-population structure, and the higher 
divergence due to both mutation and population diversification can make it 
generate more ancestors to describe the given dataset. 

Our Gibbs sampler also provides reasonable estimates of the mutation 
rates of each haplotype founder. We observe that for the conserved data 
sets, HDP yields highly consistent and low variance estimations of 0, and 
the quality of the estimates due to DP is slightly worse. For the diverse data 
both algorithms tend to slightly underestimate the mutation rates, and the 
variance is also higher. It is noteworthy that, in principal, high haplotype 
diversity of a population can be explained by two competing sources: high 
mutation rate from ancestors to descendants, and large number of ancestors. 
Indeed, K and 9 cannot be independently determined, following a similar 
argument of the un-identifiability of the evolution time and population size 
under the lAM model. But empirically, HDP appears to strike a reasonable 
balance between K and 9, and offers plausible estimates of both. 

A more thorough sensitivity analysis with respect to the hyper-parameters 
in our model is detailed in Table 1. The proposed HDP model has two scale 
parameters, 7 and r, for the upper and lower level DP, which are under in- 
verse Gamma priors as discussed in Section 2.2.3. To see the sensitivity of the 
K and 9 estimations under different priors, we applied various values of hy- 
per parameters i and k (the same for both 7 and r) on one of the 50 random 
conserved datasets. Columns 4-9 in Table 1 show the number of recovered 
founders within each sub-population (the correct number is 5 for each), and 
the total number of distinct founders over all the populations. Overall, over 
a wide range of values for the hyper-parameters, Haploi gives low-bias and 
low-variance estimation of the number of founders of each sub-population 
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Table 1 

A sensitivity analysis to the hyper-parameters of HDP on a conserved dataset. Results 
with different hyper-parameters i and k for inverse Gamma prior are shown. The 
number of founders for each population (Ki) and the total number of ancestors across all 
the populations are shown in columns 4~9. The estimated mutation rate 6 and the 
haplotyping errors (errs) are also shown through columns 10-11. The sensitivity of 9 
estimate to the hyper prior is examined over a wide range of both different magnitudes 
(0.1 to 1000) and ratios (0.0001 to 10,000) of i and k 
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as well as the total number of distinct founders. In columns 10-11, we show 
the inferred mutation rate and the haplotyping error. Even when incorrect 
numbers of founders are recovered, the actual haplotyping errors are not 
significantly affected, which shows the robustness of the proposed approach 
for ahaplotype recovering application. The test on a diverse dataset shows 
a similar tendency while the result is slightly less stable (see supplementary 
material [Sohn and Xing (2008)] for more details). 

Estimating haplotype frequencies. Figure 5 summarizes the accuracy of 
population haplotype frequencies estimated by each algorithm. The discrep- 
ancy between the true frequencies and estimated ones is measured by the 
KL-Divergence -Dkl(pIIq') = SxPI^) log The top row shows the accu- 
racy of HDP along with those of DP in mode-II and in mode-I, and the 
bottom row shows the differences between the error rates of benchmark al- 
gorithms and those from HDP. The left column of Figure 5(a) reports Dkl 
computed on ALL haplotypes frequencies estimated by different algorithms 
from the conserved data sets and the right column of Figure 5(a) shows the 
result when measured only on the frequent haplotypes (i.e., with frequen- 
cies > 0.05). Comparing to the baseline-DP, HDP is as accurate when only 
frequent haplotypes are considered. When all the frequencies are consid- 
ered, however, the margin of HDP over DP becomes significant, especially 
on the diverse dataset (p = 0.0009). Overall, Haploi, PHASE and MACH 
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Fig. 5. A comparison of the accuracies of haplotype frequencies. Top: the result from 
HDP, DP in mode-II (DP-ll), and DP in mode-l (DP-1). Bottom: the relative error rates 
of four benchmark algorithms with respect to those from HDP. (a) Box-plots of Dkl 's 
estimated from the conserved data sets. Left column shows measurements on all haplotypes, 
right column shows measurements on only the frequent haplotypes. (b) Same measurements 
on the diverse datasets. 
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work equally well without significant difference in performance on conserved 
datasets. For more difficult diverse data sets [Figure 5(b)], HDP achieves 
the lowest discrepancy by a significant margin over all the other algorithms. 
The runner-up, PHASE, beats fastPHASE and MACH with a small margin. 
When measured only on the frequent haplotypes [i.e., the right column of 
Figure 5(b)], the discrepancies decrease significantly, but the relative order- 
ing of all the compared algorithms remains similar, except that now fast- 
PHASE outperforms PHASE {p = 0.0036). 

4.2. The HapMap data. We also test Haploi on both short SNP segments 
(i.e., ~6 SNPs) and long SNP sequences (i.e., ~ 10^-10^ SNPs) available from 
the International HapMap Project. This data contains SNP genotypes from 
four populations: Utah residents with ancestry from northern and western 
Europe (CEU); Yoruba in Ibadan, Nigeria (YRI); Han Chinese in Beijing 
(CHB); and Japanese in Tokyo (JPT), with 60, 60, 45 and 44 unrelated 
individuals, respectively. Although haplotype inference can be, and in some 
test scenarios, was performed on all populations, evaluation of the outcome 
is on only the CEPHs and Yorubas since the true haplotypes can be almost 
unambiguously deduced from trios only in these two populations. The indi- 
vidual genotypes that cannot be unambiguously phased from the trios were 
ignored in the scoring. We consider three different population-composition 
scenarios in our experiments below: (1) using all the four populations to- 
gether for haplotype inference (FourPop); (2) using only CEPH and Yoruba 
populations for inference (TwoPop); and (3) phasing CEPH and Yoruba 
separately (OnePop). Essentially, in the FourPop and TwoPop scenarios we 
solve a bigger haplotype inference problem on data that contain richer pop- 
ulation information. 

4.2.1. Short SNP sequences. Phasing short SNPs is the basic operation 
of large-scale haplotype inference problems which rely either on partition- 
ligation heuristics or on model-based methods, such as recombination pro- 
cess, to integrate short phased haplotype segments into long haplotypes. 
Figure 6 shows a comparison of the phasing accuracy on 6-SNP segments 
[following a recommendation in Niu et al. (2002) on the optimal size-range 
of basic units for subsequent ligation] by four algorithms. The test was done 
on randomly selected 100 sets of 6-SNPs segment from chromosome 21. For 
each of the three population-composition scenarios, we applied all meth- 
ods to different population sizes, that is, 60, 30, 20 and 10 individuals per 
population, to examine the effect of population size on phasing accuracy. 

Several aspects of Haplofs performance on real data are revealed by Fig- 
ure 6. First, comparing the performances of Haploi under the three different 
population-composition scenarios, we observe that Haploi improves steadily 
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Fig. 6. A comparison of the haplotyping error on the CEPH+ Yoruha population over 
randomly chosen 100 sets of 6-SNP segments from Chromosome 21. The results were 
obtained under three population-composition scenarios: (i) FourPops: when data from all 
the four populations were used (blue) for inference; (ii) TwoPops: when data from CEPH 
and Yoruha populations were used together (green); (iii) OnePop: when each of the CEPH 
and Yoruba populations was used separately (gray). Different sample sizes, with 60, 30, 20 
and 10 individuals per each population, were used. 



as more populations are included in haplotype inference, and the improve- 
ments are statistically significant. The p-values of the differences between 
FourPop and OnePop scenarios are 0.00024, 0.000038, 0.0016 and 0.000022 
for data with 60, 30, 20 and 10 individuals per population, respectively; 
and the p- values of the margins of TwoPop over OnePop are 0.0014, 0.0002, 
0.0053 and 0.00047, respectively, in the same order. The improvement in 
FourPop over TwoPop is less significant, with p-values 0.35, 0.11, 0.16 and 
0.023, respectively, suggesting that the possible gain in haplotype accuracy 
enabled by the HDP model via exploring shared information among popu- 
lations can be capitalized the most when we change from single-population 
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inference to joint-inference in multiple population, whereas the effect of hav- 
ing more populations in the multi-population scenario appears to be less 
obvious in this dataset. 

Second, comparing the performances of Haploi under different population 
sizes, we observe that the performance-gain through information sharing 
among populations tends to be greater when the population sizes decrease. 
For example, the performance differences of Haploi in multi-population over 
single-population become most significant when the number of individuals 
per population is the smallest (7^ Individual per pop = 10). This observation 
suggests that HDP is especially advantageous under data scarcity situations 
where information from each population becomes insufficient to warrant 
reliable inference within the population. 

Third, other methods, such as PHASE, MACH and Beagle, appear not 
able to benefit from increased population diversity as indicated by the sig- 
nificant drop of their accuracies when more populations are involved. The 
performance of fastPHASE (with known population labels) improves sub- 
stantially when two populations are used together, while the performance 
becomes slightly worse in the case of four populations. Comparing the results 
from the most preferred scenario of each algorithm, that is, Haploi under 
FourPop, fastPHASE under TwoPop, and all the others under OnePop, Hap- 
loi and PHASE worked similarly well when all the available data were used 
(i.e., ^Individual per pop = 60), with mean error rate of each algorithm at 
0.0174, 0.0198, 0.0173, 0.0229 and 0.0222, respectively (withp= 0.05, 0.89, 
0.10, 0.01 over differences of Haploi with other algorithms). When the pop- 
ulation sizes decrease, Haploi starts to surpass others more substantially, 
and works more reliably than others. For example, on 10 individuals per 
population, the mean error rates of the five algorithms were 0.0424, 0.0460, 
0.0512, 0.0777 and 0.0945, and the p-values of the margin of Haploi over 
others are 0.17, 0.02, 1.2 x 10^ 6.7 x 10^ respectively. 

4.2.2. Long SNP sequences. Finally we test Haploi on very long geno- 
type sequences with 10^ ~ 10^ SNPs. We selected 10 ENCODE regions from 
the HapMap DB, each spanning roughly 500 Kb and containing from 254 to 
972 common SNPs across all four populations (see supplementary material 
[Sohn and Xing (2008)] for more details). We performed haplotype infer- 
ence under three different population-composition scenarios as before, but 
due to the extremely high cost in computational time in these experiments, 
we only worked on the full-size data sets. Figure 7 shows a comparison of 
haplotype reconstruction quality, using PHASE, fastPHASE, MACH, Bea- 
gle and Haploi equipped with the PL heuristic. Out of the 30 experiments 
we performed (10 regions and three scenarios), the PHASE program failed 
to yield results in 5 experiments after a 31-day runtime, so we omit the 
corresponding results in our summary figure. 
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The conclusion from Figure 7 is less clear than the ones from previous 
sections from experiments on short SNP sequences and on simulation data. 
Overall, Beagle dominates all the algorithms with a small margin, PHASE 
also shows comparable result to Beagle when converged, but all the other 
algorithms work comparably in most cases across different datasets and dif- 
ferent scenarios. In terms of computational cost. Beagle was the fastest, it 
took less than a minute for each task; fastPHASE and MACH mostly took 
less than 1 hour for each task, Haploi took from 1-10 hours, depending on 
the length of the sequence, whereas PHASE took one to two orders of mag- 
nitude longer, and was indeed impractical for phasing very long sequence. 

In summary, our result shows that Haploi is competent and robust for 
phasing long SNP sequences from diverse genetic origins at reasonable time 
cost, even though it has not yet employed any sophisticated way for pro- 
cessing long sequences, such as the recombination process. Since Haploi ap- 
peared to outperform other methods over short SNPs, we believe that the 
competence of Haploi on long SNPs is due to a better inference power en- 
dowed by the HDP model for multi-population haplotypes, and we expect 
that an upgrade that incorporates explicit recombination models in conjunc- 
tion with HDP for long SNPs is likely to lead to more accurate haplotype 
reconstructions . 



5. Discussion. We have proposed a new Bayesian approach to haplotype 
inference for multiple populations using a hierarchical Dirichlet process mix- 
ture. By incorporating an HDP prior which couples multiple heterogeneous 
populations and facilitates sharing of mixture components (i.e., haplotype 
founders) across multiple Dirichlet process mixtures, the proposed method 
can infer the true haplotypes in a multi-subpopulation dataset with an ac- 
curacy superior to the state-of-the-art haplotype inference algorithms. 
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Fig. 7. Performance on the full sequences of the selected ten ENCODE regions, (a) Error 
rates under four-population scenario, (b) Under the two-population scenario, (c) Under 
the one-population scenario. For cases of which the program does not converge (NC) within 
a tolerable duration (i.e., 800 hours), we cap the bar with a to indicate that the results 
are not available (NA). 
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Recently, there emerged new models related to our HDP model, the closest 
being the nested Dirichlet process (NDP) by Rodriguez, Dunson and Gelfand 
(2006). In an NDP, instead of using a hyper-DP as a common base measure 
as in HDP to allow sharing of founders across populations, the population- 
specific DPs are directly drawn from a prior DP, so that not only the 
founders, but also their frequencies can be shared across populations. Al- 
though this model can be more expressive in many applications, it may be 
less appropriate than HDP for multi-population haplotype problems where 
excessive structural sharing across populations are not warranted, especially 
when different populations bear very distinct demography and genetic proto- 
types. Another strategy proposed by Muller, Quintana and Rosner (2004) 
employs an explicit stochastic convex combination of a population-specific 
prior and a universal prior for each founder. Under such a model, once a 
founder is destined to be shared across populations, it will appear with 
equal frequency in all populations. HDP subsumes this scenario, but also 
allows more flexible sharing of the founders. 

The proposed model achieves the desirable properties of PAC regard- 
ing mutation dynamics [Li and Stephens (2003)], including the parental- 
dependent-mutation effect, albeit in a very different way. For example, to 
see the PDM property, note that when the next haplotype is to be sampled 
according to equation (1), we pick an ancestor of some previously drawn 
haplotypes, and apply a mutation process to the ANCESTOR (rather than 
to one of the previously drawn haplotypes as in PAC). This operation im- 
plicitly results in a PDM effect among haplotypes by relating them to their 
corresponding founder via a tractable star genealogy equipped with a com- 
mon mutation process Ph{-\founder) . A new haplotype generated from this 
process will bear mutations over its corresponding founders rather than be- 
ing completely random. Above these founders, we model their genealogy and 
type history by a coalescent-with-IMA model, whose resulting marginal is 
equivalent to that of the Dirichlet process. Here a new founder can be sam- 
pled independently of the type-history in the coalescent from the base mea- 
sure, rather than according to a PDM, with probability proportional to the 
IMA mutation rate. Putting everything together, the DP mixture model es- 
sentially implements a combination of IMA and PDM: it models the geneal- 
ogy and type history of hypothetical ancestors presumably corresponding to 
a bottleneck with a coalescent-with-IMA model; below the bottleneck, it uses 
multiple (indeed, can be countably infinite many) star genealogies rooted 
at the ancestors present in the bottleneck and equipped with ancestor- 
dependent Poisson mutation process, to approximate the coalescent-with- 
PDM model. The time of the bottleneck depends on the value of the scaling 
parameter a of the DP. One can introduce a prior to this parameter so that 
it can be estimated a posteriori from data. 
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It is well known that under Kingman's n-coalescent, a dominant portion of 
the depth of the coalescent tree is spent waiting for the earliest few lineages 
to coalesce to the MRCA and the majority of lineages of even a very large 
population can actually coalesce very rapidly into a few ancestors, which 
means that the net mutation rates from each of these ancestors to their de- 
scendants in a modern haplotype sample do not vary dramatically among the 
descendants. Thus, qualitatively, a star genealogy provides a reasonable ap- 
proximation to the actual (heavily time-compressed) genealogy of a modern 
haplotype sample up to these ancestors. As a reward of such approximation, 
a well-known property of DP mixture is that it defines an exchangeable dis- 
tribution of the samples. Furthermore, the Polya urn construction of DP 
enables simple and efficient Monte Carlo for posterior inference of haplo- 
types and other parameters of interest, and the DPM formalism offers a 
convenient path for extensions that capture more complex demographic and 
genetic scenarios of the sample, such as the multi-population haplotype dis- 
tribution as we explored in this paper. 

Unlike the models underlying PHASE and fastPhase, the PL heuristic 
used in the Haploi program does not explicitly model the recombination 
process that shapes the LD patterns of long SNP sequences. Since an HDP 
model without the aid of PL-scheme dominates PHASE and fastPhase over 
short SNPs, we believe that an upgrade that incorporates an explicit recom- 
bination model in conjunction with HDP is likely to lead to more accurate 
reconstruction of long haplotypes. The hidden Markov Dirichlet process re- 
cently developed by us to model recombination in open ancestral space offers 
a promising path for such an upgrade [Xing and Sohn (2007)]. Under the 
proposed statistical framework for modeling haplotype and genotype distri- 
bution, it is also straightforward to handle various missing value problems 
in a principled way. In another possible extension, although in the present 
study we have assumed that the subpopulations' labels of individuals are 
known, it is straightforward to generalize our method to situations in which 
the subpopulations' labels are unknown and to be inferred. This opens the 
door to applications of our method to large-scale genetic studies involving 
joint inference over markers and demography. The HDP model is also a 
natural formalism for applications outside of population genetics, such as 
in text modeling, where one can use an HDPM to model co-clustering of 
documents from different journals (analogous to different populations here) 
according to both shared and unique topics defined by, for example, a latent 
Dirichlet allocation model [Blei, Ng and Jordan (2003)], and also in net- 
work modeling, where the neighbor profiles of every node can be modeled 
by a low-level DPM whose likelihood function is defined by, for example, a 
mixed membership stochastic block model [Airoldi et al. (2006)], and the 
entire network corresponds to an HDP over all nodes. Due to space limits, 
a detailed description of our work in these applications is beyond the scope 
of this paper. 
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APPENDIX A: MARKOV CHAIN MONTE CARLO FOR HDP 

In this section we describe a Gibbs sampling algorithm for posterior infer- 
ence of haplotypes under the HDPM model. We start with a brief description 
of the HDP formalism in terms of Polya urn models. Imagine we set up a 
single "stock" urn at the top level, which contains balls of colors that are 
represented by at least one ball in one or multiple urns at the bottom level. 
At the bottom level, we have a set of distinct urns which are used to define 
the DP mixture for each population. Now let us suppose that upon drawing 
the nijih ball for urn j at the bottom, the stock urn contains n balls of K 
distinct colors indexed by an integer set C = {1,2, ... ,K}. Now we either 
draw a ball randomly from urn j, and place back two balls both of that 
color, or with some probability we return to the top level. From the stock 
urn, we can either draw a ball randomly and put back two balls of that color 
in the stock urn and one in j , or obtain a ball of a new color K + 1 with 
probability and put back a ball of this color in both the stock urn 

and urn j of the lower level. Essentially, we have a master DP (the top urn) 
that serves as a source of atoms for J child DPs (bottom urns). 

Associating each color k with a random variable (pk whose values are 
drawn from the base measure F, we know that draws from the stock urn 
can be viewed as marginals from a random measure distributed as a Dirichlet 
Process Qq with parameter (7,^). From equation (1), for n random draws 
(f) = {(pi, . . . ,(pn} from Qo, the conditional prior for ((/>„,|(^_„), where the 
subscript "— n" denotes the index set of all but the nth ball, is 

K 

~ r- — %.('/'n) H — — F{cl)i), 

^ n — 1 + 7 '' n — 1 + 7 

where (p^jk = 1, . . . , K, denotes the K distinct values (i.e., colors) of (f) (i.e., 
all the balls in the stock urn), and denotes the number of balls of color 
k in the top urn. 

Conditioning on Qq (i.e., using Qq as an atomic base measure of each 
of the DPs corresponding to the bottom-level urns), the rn^th draws from 
the jth bottom-level urn are also distributed as marginals under a Dirichlet 
measure which leads to the distribution shown in equation (2). 

This nested Polya urn scheme motivates an efficient and easy-to-implement 
MCMC algorithm to sample from the posterior associated with HDPM. Re- 
call that the mixture components (pk correspond to the ancestral haplotypes 

with their mutation rates, and the samples correspond to individual hap- 

(7) (7) 

lotypes. Therefore, the variables of interest are 0^,4, hfj^, clj^, 7 and r, 

and gpj'* (the only observed variables). We may assume that the represented 
mixture components are indexed by 1,...,K, the weights of the founders 
at the top level DP is /3 = ( , . . . , "5. , — 2_) where — 2_ ig the 

^ *-n— 1+7' ' n— 1+7 ' n— 1+7 ' ' n— 1+7 



28 



K.-A. SOHN AND E. P. XING 



total weight corresponding to some unrepresented founder K + 1, and the 
weights of founders at the bottom-level DP for, say, the j'th population, are 
(t^^^, T^^p^^ T^[pT+^)^ where ^^^-^ corresponds to the probabil- 
ity of consulting the top-level DP. The Gibbs sampler alternates between 
three coupled stages. First, we sample the scaling parameters 7 and r of the 

DPs according to equation (7). 

(7) 

Then, we sample the c] and a^^t given the current values of the hidden 

haplotypes and the scaling parameters. Before sampling c , we first erase 

" (7) 

its contribution to the sufficient statistics of the model. If the old c) was 
k' , set rrijf^' = rrijy — 1. If it was sampled from the top level DP, we also set 

nk' = nk' — 1. Note that afj < K + 1 (i.e., indicating existing founders, plus 

a new one to be instantiated). Now we can sample c['^J from the following 
conditional distribution: 

p(c5f) = A:|c[-^'*=l,h,a) 

(8) ocp{4i^ = k\c^-^^'^\m,n)p{h\i^\ak,cM~^^''^) 

oc (mj^''^! + rPk)p{h^ lafc, 4"''''=') for A: = 1, . . . , + 1, 

in group j, and rrij^K+i = 0; \ ^ denotes the sufficient statistics associated 

with all haplotype instances originating from ancestor k, except . If, as 
(7) 

a result of sampling , a formerly represented founder is left with no hap- 
lotype associated with it, we remove it from the represented list of founders. 
If, on the other hand, the selected value k is not equal to any other existing 
index , that is, cfj = K +1, we increment K hy 1, set nx+i = 1, update 
P accordingly, and sample ax+i from its base measure F. 

Now, from equation (4), we can use the following posterior distribution 
to sample ak'. 

p{ak,t\c,h) (X II p{h\^J^t\ak,t,lk}) 

(9) 



where mij'^"^ represents the number of c^.P that are equal to k, except cj"''* 



j.ie\c'\^\ = k 



naf, + lk,t)T{Ph + l'k,t) . , 
—j-R{ah, Ph) 

r{ah + /3h + mk){\A\-iy>'.^ 



where Ik^t is the number of allelic instances originating from ancestor k at 
locus t across the groups that are identical to the ancestor, when the ancestor 
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has the pattern ak,t- If k was not represented previously, we can just use zero 
values of lk,t, which is equivalent to using the probability p{a\hl^J). 

We now proceed to the third sampling stage, in which we sample the 
haplotypes h/J , given the current state of the ancestral pool and the an- 
cestral haplotype assignment for each individual, according to the following 
conditional distribution: 

-"^^ r(a, + /3, + /J) 

X Kh 



r(a,, + /3ft, + n,.)(|^|-iP'---* 

where k' = cjf^ ^ = + I{hlll = ^""^ ^[-L*] ^'^^ 

sufficient statistics recording the inconsistencies between the haplotypes and 
genotypes in population j. 

APPENDIX B: DETAILS OF THE PL PROCEDURE 

This section describes the detailed procedure of partition-ligation algo- 
rithm used in Haploi, which can be divided into three steps: (1) atomic block 
typing; (2) bottom-level pairwise ligation to generate overlapping blocks; and 
(3) hierarchical ligation of overlapping blocks until only one block is left. In 
step 1, we partition given genotype sequences into L short blocks of length 
T and phase each atomistic block using the proposed HDPM. Prom this 
step, we obtain all the individual haplotypes and also the population hap- 
lotype pool for each block. Let Aj = {Af. (^i_i-^j'j^i . j^l A: = 1, . . . , K]"} denote 
the population haplotype pool for T SNPs in the ith block which ranges 
from locus (i — 1)T + 1 to iT. 

In the next step, we ligate every pair of neighboring blocks: AfSzAf_^i 
Aj^ ,i = 1, . . . ,L — 1. Specifically, for each pair of neighboring blocks i and 
i + 1, given Ai[ and A^^i, we can impute at most four new stitched haplo- 
types from an individual since each individual has only two possible haplo- 
types within each block. In practice, we often have fewer because an indi- 
vidual can be homozygous or the stitched haplotype may already have been 
imputed from earlier individuals. We pool such stitched haplotypes from all 
the individuals to form Aif^ , which usually leads to only a small subset of 
A^l^ X Ai^j^i- Then based on a finite dimensional Dirichlet prior over Aif^ , we 
do Gibbs sampling as in Niu et al.'s PL scheme to obtain individual hap- 
lotypes for each overlapping 2T region. To compensate possible ill-ligated 
blocks, we can redo the direct haplotype inference based on HDPM on those 
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merged blocks whose entropy of haplotype distribution is above some thresh- 
old (Figure 2, step 2-1). This is computationally affordable since the length 
of the ligated block at this stage is not yet too big and we can start with 
better initialization than random assignment. The output from step 2 is 
L — 1 sets of length 2T population haplotypes, {^f^ : i = 1, . . . , L — 1}, over- 
lapping on T loci for each adjacent pair, and all individual haplotypes in 
these length 2T overlapping segments. 

In step 3, we hierarchically ligate overlapping adjacent blocks from the 
previous iteration, until the full sequence is covered (Figure 2, step 3). Specif- 
ically, as in step 2, we build the candidate population haplotype pool by 
adding every unique stitched-haplotype resulted from ligating the haplo- 
types of the two shorter blocks in every individual. When the overlapping 
regions of a pair of atomistic haplotypes in an individual are consistent, 
ligation to a longer haplotype is trivially a merging of the two overlapping 
haplotypes, and this avoids generating all combinations of the atomistic 
haplotypes from each block. Only when the overlapping regions in an indi- 
vidual are inconsistent, we grow the haplotype space of the merged blocks 
by including all possible ligations consistent with the atomistic haplotypes 
and the individual genotype. For example, suppose a particular individ- 
ual's haplotypes were recovered as 000100/100010 at loci 1 to 6 for the 
first block, and 110000/000100 at loci 4 to 9 for the next block, and three 
SNPs are overlapping in the two blocks. Then to accommodate the discrep- 
ancy on the 4th and 5th SNPs, we have four possible haplotypes, 10, 01, 
00, 11, for these two loci; for the remaining parts of the region covered by 
these two blocks, that is, loci 1-3 and loci 6-9, we have two haplotypes 
(which are from the atomistic haplotypes determined in the previous iter- 
ation) for each of them. So a combination of all these possibilities will add 
the following sixteen haplotypes to the population haplotype space for the 
ligated segment: 

000100000/010010100, 000110000/100000100, 000010000/100100100, 
000000000/100110100, 000100100/100010000, 000110100/100000000, 
000010100/100100000, 000000100/100110000. 

Under the newly formed population haplotype space at each ligation itera- 
tion, we again apply a Gibbs sampler as in step 2 to determine the individual 
haplotypes of all remaining unphased individuals over the ligated block un- 
der a fixed-dimensional Dirichlet prior of the haplotype frequencies in this 
trimmed haplotype space. We continue this process hierarchically until there 
is only one block left. Since each time we only employ overlapping regions 
of size r, the number of steps needed to complete the ligation of a whole 
sequence is comparable to Niu et al. 's hierarchical PL scheme. 
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SUPPLEMENTARY MATERIAL 

More results on sensitivity analysis and description of real data (DOI: 
10.1214/08-AOAS225SUPPB; .pdf). We provide the sensitivity analysis re- 
sult to the hyper-parameters of HDP on diverse dataset {0 = 0.05), and the 
details of the real data from 10 HapMap ENCODE regions used in Figure 
7. 
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